Invisible Glue: Scalable Self-Tunning Multi-Stores
نویسندگان
چکیده
Next-generation data centric applications often involve diverse datasets, some very large while others may be of moderate size, some highly structured (e.g., relations) while others may have more complex structure (e.g., graphs) or little structure (e.g., text or log data). Facing them is a variety of storage systems, each of which can host some of the datasets (possibly after some data migration), but none of which is likely to be best for all, at all times. Deploying and efficiently running data-centric applications in such a complex setting is very challenging. We propose Estocada, an architecture for efficiently handling highly heterogeneous datasets based on a dynamic set of potentially very different data stores. Estocada provides to the application/programming layer access to each data set in its native format, while hosting them internally in a set of potentially overlapping fragments, possibly distributing (fragments of) each dataset across heterogeneous stores. Given workload information, Estocada self-tunes for performance, i.e., it automatically choses the fragments of each data set to be deployed in each store so as to optimize performance. At the core of Estocada lie powerful view-based rewriting and view selection algorithms, required in order to correctly handle the features (nesting, keys, constraints etc.) of the diverse data models involved, and thus to marry correctness with high performance. 1. CONTEXT AND OUTLINE Digital data is being produced at a fast pace and has become central to daily life in modern societies. Data is being produced and consumed in many data models, some of which may be structured (flat and nested relations, tree models such as JSON, graphs such as those encoding RDF data or social networks) and some of which may be less so (e.g., CSV or flat text files). Each of the data types above arises in application scenarios including traditional data warehousing, e-commerce, social network data analyThis article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, provided that you attribute the original work to the author(s) and CIDR 2015. 7th Biennial Conference on Innovative Data Systems Research (CIDR ’15) January 4-7, 2015, Asilomar, California, USA. sis, Semantic Web data management, data analytics, etc. It is increasingly the case that an application’s needs can no longer be met within a single dataset or even within a single data model. Consider for instance a traditional customer relationship management (CRM) application. While typically CRM needed to deal only with a relational data warehouse, now the application needs to incorporate new data sources in order to build a better knowledge of its customers: (i) information gleaned from social network graphs about clients’ activity and interests, and (ii) log file from multiple e-commerce stores, characterizing the clients’ purchase activity in those stores. Monetizing access to operational databases is predicted to grow, thus access to such third-party data sources is increasing. The in-house RDBMS performs just fine on the relational data. However, the social graph data fits badly in that system, and the company attempts to store it in a dedicated graph store, until an engineer argues that it should be decomposed and stored into a highly-efficient NoSQL key-value store system she has just experimented with. The storage and processing of log files is delegated to a Hive installation (over Hadoop), until the summer research intern observes that recent work [14] has shown that some data from Hive should be lifted at runtime in the relational data warehouse to gain a few orders of magnitude of performance! Deploying and exploiting the CRM application for best performance is set to be a nightmare now. There is little consensus on what systems to use, if any; three successive engineers have recommended (and moved the social data into and out of) three different stores, one for graphs, one for key-value pairs, and the last an in-memory column database. Part of the log data has been moved in the inmemory column store, too, when the social data was stored there; this made their joint exploitation faster. But the whole log dataset could not fit in the single-node column store installation, and data migration fatigue had settled in before a suggestion was made (and rejected) to move everything to yet another cluster installation of the column store. The team working on the application feels battered and confused. The application is sometimes very slow. Migrating data is painful at every change of system; they are not sure the complete data set survived at each step, and data keeps accumulating. Yet, a new system may be touted as the most efficient for graph (or for log) data next week. How are they Gartner predicts that 30% of business will do it by 2016: http://www.gartner.com/newsroom/id/2299315. to tell the manager that no, they are not going to migrate the application to that system? What, if any, part of the data to deploy there? Would it be faster? Who knows? In this work, we present Estocada, a platform we have started building, to help deploy and self-tune for performance applications having to deal with mixed-model data, relying on a dynamic set of diverse, heterogeneous data stores. While heterogeneous data integration is an old topic [16, 9, 5, 19], the remark “one-size does not fit all” [22] has been revisited for instance in the last CIDR [18, 12, 6], and the performance advantages brought by multi-stores have been recently noted e.g., in [14]. Self-tuning stores have also been a topic of hot research. The set of features which, together, make Estocada novel are: Natively multi-model Estocada supports a variety of data models, including flat and nested relations, trees and graphs, including important classes of semantic constraints such as primary and foreign keys, inclusion constraints, redundancy of information within and across distinct storage formats, etc. which are needed to enforce application semantics. Application-invisible Estocada provides to client applications access to each dataset in its native format. This does not preclude other mapping / translation logic above Estocada’ client API but we do not discuss them in this paper. Instead, our focus is on efficiently storing the data, even if in a very different format from its original one, as discussed below. Fragment-based store Each data set is stored as a set of fragments, whose content may overlap. The fragmentation is completely transparent to Estocada’ clients, i.e., it is the system’s task to answer queries based on the available fragments. Mixed store Each fragment may be stored in any of the stores underlying a Estocada installation, be it relational, treeor graph-structured, based on key-value pairs etc., centralized or distributed, diskor memorybased etc. Query answering must be aware of the constraints introduced implicitly when storing fragments in non-native models. For instance, when tree-structured data are stored in a relational store, the resulting edge relation satisfies the constraint that each node has at most one parent, the descendant and ancestor relations are inverses of each other and are related nontrivially to the edge relation, etc. Self-tuning store The choice of fragments and their placement is fully automatic. It is based on a combination of heuristics and cost-based decisions, taking into account data access patterns (through queries or simpler data access requests) as these become available. View-based rewriting and view selection The invisible glue holding all the pieces together is view-based rewriting with constraints. Specifically, each data fragment is internally described as a materialized view over one or several datasets; query answering amounts to viewbased query rewriting, and storage tuning relies on view selection. Describing the stored fragments as views over the data allows changing the set of stores with no impact on Estocada’ applications [9]; this simplifies the migration nightmare outlined above. Finally, our reliance on views gives sound foundation to efficiency, as it guarantees the complete storage of data, and the correctness of the fragmentation and query answering, among others. Technical challenges The Estocada scenario involves the coexistence of a large number of materialized views mixing data formats (modeling the native sources) with significant redundancy between them (due to repeated migration and view selection arising organically over the history of the system, as opposed to clean-slate planning). While the problem of rewriting using views is classical, it has typically been addressed and practically implemented only in limited scenarios that do not apply here. These scenarios feature (i) only relatively small numbers of views; (ii) minimal overlap between views as their selection is planned ahead of time; (iii) views expressed over the same data model; (iv) rewriting that exploits only limited integrity constraints (typically only key/foreign key in existing systems). The large number of views and their redundancy notoriously contribute (at least) exponentially to the explosion in the search space for rewritings, even when working within a single data model. In the sequel, we introduce some motivating scenarios, present Estocada’s architecture, and walk the reader through the main technical elements of our solution, by means of an example. Finally, we discuss related works, then conclude. 2. APPLICATION SCENARIOS We now present two typical scenarios which stand to benefit from our proposal. They are inspired from real-world applications being built by several partners including us, within the French R&D project Datalyse on Big Data analytics. (http://www.datalyse.fr). Open Data warehousing for Digital Cities. The application is based on Open Data published and shared in a digital city context. The purpose is to predict the traffic flow and the consequent customer behavior taking into account information about events that influence people behavior such as city events (e.g., celebrations, demonstrations, or a highly attended show or football match), weather forecasts (bad weather often leading to traffic slowdown), etc. The analysis is performed in a metropolitan area of a 600.000strong French city. The data used in the project comes from city administrations, public services (e.g., weather and traffic data), companies, and individuals in the area, through Web-based and mobile applications; the sources are heterogeneous, comprising RDF, relational, JSON and key-value data. More precisely, (i) Open data about traffic and events is encoded in RDF; (ii) city events information is published as JSON documents; (iii) social media data holding users’ locations, as well as their notifications of public transport events (such as delayed buses or regional trains) is organized in key-value pairs; (iv) weather data is generated as relational tuples. The application comprises the following queries: • Estimated concentration of people and vehicles in a particular area and moment. One use of such information is to plan some corrective/preventive measures, for instance diverting traffic from an overcrowded area; it can also be used to identify business opportunities according to where people are. This query requires: traffic information, social events, social network information, the weather (open space events may be cancelled depending on the weather) and public transport information. • Most popular trajectories in the area which can be used to optimize public transport or car traffic routes. This needs traffic information, social network data on traffic events, and public transportation timetables. Large-scale e-commerce application. We consider a scenario from a large-scale e-commerce application whose goal is to maximize sales while improving the customer experience. A large retailer wants to use the clients’ social network activity with the e-commerce application logs, to improve targeted product recommendation. This requires exploiting the data produced by the users both actively (orders, product reviews, etc.) and passively (logs), as well as social network data. The heterogeneous sources of information in this scenario are: (i) shopping cart data stored in key-value pairs; (ii) the product catalog, structured in documents using a data store with full-text search, faceted search and filtering capabilities; (iii) orders, stored in a relational data store; (iv) user data (such as birth date, gender, interests, delivery addresses, preferences, etc.) organized in documents; (v) products reviews and ratings, structured in a partitioned row store, and (vi) social network data organized in key-value pairs. The queries in this scenario: • Retrieve items to recommend to each user, by displaying them on the user homepage and inside bars while the user is shopping. This requires combining cart information, the product catalog, past relational sale data, personal user data, product reviews, and possible recommendations gleaned from the social network. • Improve product search: this requires attribute-based search in the product catalog, as well as user’s recent search and purchase history to decide which products to return at the top of the search result. This application uses massive data (the sale history of many users), yet response time is critical, because query answers must be made available in the user’s Web interface. To get such performance, the engineers in charge of the application have decided to use memcached [23] to make access to parts of the database very fast. However, automatically deciding which parts to put in memcached, and correctly computing results out of memcached and the other external sources is challenging, especially given the heterogeneity of the data representation formats.
منابع مشابه
Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms
Scalable and Elastic Transactional Data Stores for Cloud Computing Platforms by Sudipto Das Cloud computing has emerged as a multi-billion dollar industry and as a successful paradigm for web application deployment. Economies-of-scale, elasticity, and pay-peruse pricing are the biggest promises of cloud. Database management systems (DBMSs) serving these web applications form a critical componen...
متن کاملCATS: Linearizability and Partition Tolerance in Scalable and Self-Organizing Key-Value Stores
Distributed key-value stores provide scalable, fault-tolerant, and selforganizing storage services, but fall short of guaranteeing linearizable consistency in partially synchronous, lossy, partitionable, and dynamic networks, when data is distributed and replicated automatically by the principle of consistent hashing. This paper introduces consistent quorums as a solution for achieving atomic c...
متن کاملDynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture
Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...
متن کاملSWIRL: A Scalable Watermark to Detect Correlated Network Flows
Flow watermarks are active traffic analysis techniques that help establish a causal connection between two network flows by content-independent manipulations, e.g., altering packet timings. Watermarks provide a much more scalable approach for flow correlation than passive traffic analysis. Previous designs of scalable watermarks, however, were subject to multi-flow attacks. They also introduced...
متن کاملMulti-objective and Scalable Heuristic Algorithm for Workflow Task Scheduling in Utility Grids
To use services transparently in a distributed environment, the Utility Grids develop a cyber-infrastructure. The parameters of the Quality of Service such as the allocation-cost and makespan have to be dealt with in order to schedule workflow application tasks in the Utility Grids. Optimization of both target parameters above is a challenge in a distributed environment and may conflict one an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015